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Abstract 

We propose a simple method to predict individuals' expectations about products using a knowl- 
edge network. As a complementary result, we show that the method is able, under certain condi- 
tions, to extract hidden information at neural level from a customers' choices database. 

1 Introduction 

Personal tastes are universally considered very difficult to be analyzed (there's an old latin proverb 
stating "De gustibus non disputandum est", i.e. "There's no accounting for taste"), nevertheless, 
there is evidence of certain regularities in personal preferences, allowing people to successfully choose 
Christmas presents for friends and relatives. 

Indeed, any modeling of market behavior assumes a rational behavior of agents, that choose among 
the possible options on quantitative basis. In a famous paper pQ, Stigler and Becker argued that all 
people have fixed tastes except for small variations, and that the different patterns in taste investments 
(like buying new music disks) are computable from the expectations in revenues (i.e. the forecasted 
enjoy in future music exposure). Many people have argued that this purely economic point of view 
is ignoring the "enormous role of historical and cultural forces, education, and values, as the initial 
shapers of our preferences" 

Anyhow, in order to test any economic/psychologic/moral point of view about tastes, we need a 
quantitative model of tastes formation and, more important yet, of tastes anticipation based on past 
experience. 

The starting point of our analysis is that the opinion of an agent on a given product is formed by 
the match between agent's set of preferences/tastes and product's qualities. 

While many commercial studies are based on surveys about customer's preferences, we assume that 
both preferences and qualities are hidden degrees of freedom, and that only the expressed opinion is 
observable. One of the goal of our study is to develop techniques able to extract information about 
the hidden parts from the correlations among agent's opinions on products. 

Let us suppose that there exists a database of agents' opinions on a given set of products. This 
database can be seen as a sparse matrix, with holes corresponding to missing opinions (say, agents 
that have never been exposed to a given product). 



In geometrical words, one represents agent's preferences as a vector in an hypothetical taste space, 
whose dimension and base vectors are unknown. A product is represented by a similar vector (in dual 
space). Agent's opinion on a given product is given by an operation analogous to the scalar product 
between preferences and properties. Therefore, products acts like a basis, and opinions as agent's 
coordinates on such a basis. However, differently from usual geometrical problems, we do not know 
what the basis is, if it is complete, etc. 

As we shall recall in Section[21 S. Maslov and Y.C. Zhang have shown that it is possible, if we know 
the basis of agent's preferences, to reconstruct the vectors of the individual tastes from the knowledge 
of a sparsely connected network of the overlaps (scalar products) among preferences We want to 
extend this result to the more usual case in which basis information is not at our disposal, as discussed 
in Section 

One of the outcome of our analysis is the possibility of opinion anticipation, i.e. the possibility 
of exploiting the correlations in the database to forecast the missing opinions. Alternatively, we can 
obtain information about the overlaps of tastes between two individuals from the knowledge of their 
expressed opinions. 

What we think is our main result, is the possibility of extracting information about the hidden 
degrees of freedom, and in particular the dimensionality of hidden space, from the opinion database. 
In this way customer's commercial interests can be used as tools of cognitive psychology. 

As we shall discuss in Section the sparseness of data and a bias in the database can be included 
in the model. 

The results of the comparisons between the theory and numerical simulations over randomly- 
generated data are presented in Sectional 

Finally, in sectional we summarize our work and draw some conclusions. 

2 The Model 

We consider a population of M individuals interacting with a set of N products. We assume that each 
product is characterized by an i-dimensional array a — {a^',a^ 2 ', . . . ,a^) of features, while each 
individual has the corresponding list of L personal tastes on the same features 6= (fcW, &< 2 ), . . . , ftW). 
For numerical simulations we have chosen both al an d bm in the set {—1, 1}. 

The opinion of individual m on product n, denoted by s m ,n, is defined proportional to the internal 
product between b m and a n : s m „ = X(L)b m -a„, where X(L) is a suitably chosen normalization factor. 
In general, X(L) should scale as L^ 1 and depend on the ranges of a and b. For our choice of hidden 
parameters, we use A(L) = 1/L, so that s m ^ n lies in the interval [—1,1]. 

In order to predict whether the person j will like or dislike a certain product a n , assuming to know 
a n , it is sufficient to predict the individual tastes of person j, i.e. the vector bj. 

The similarity between tastes of two individuals i and j is defined by the overlap f2y = bi ■ bj 
between the individual tastes bi and bj . 

One can build a knowledge network among people, using the vectors b m as nodes and the overlaps 
f2y as edges. Maslov and Zhang |2| (MZ) assume that a fraction p of these overlaps are known. They 
show that there are two important thresholds for p in order to be able to reconstruct the missing 
information. 

The first one is a percolation threshold, reached when the fraction of edges p is greater than 
pi = 1/A1 — 1 where M is the number of people. This means that there must be at least one path 
between two randomly chosen nodes, in order to be able to predict the second node starting from the 
first one. 

Since vectors b n lie in an L dimensional space, and a single link "kills" only one degree of freedom, a 
reliable prediction need more than one path connecting two individuals. Maslov and Zhang show that 
there is a "rigidity" threshold P2, of the order of 2L/M, such that for p > p2 the mutual orientation 
of vectors in the network are fixed, and the knowledge of preferences of just one person is sufficient to 
know those of all the rest of individuals. 
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3 Extracting information from hidden quantities 



In general one does not have access to individual's preferences. Nor one knows the dimensionality L 
of this space. In order to address this problem, let us define the opinion correlation matrix C: 



X/n=l( S ^ n s i){ s j,n s j) 



En=l( S i 



S ') 2 En=l( S J> S J') Z 



(1) 



where Si is the average of the opinion matrix S over column i. 

We show below that one can compute an accurate opinion anticipation s m .„ of a true value s rn . n 
using this formula: 

k M 

i=l 

where k is a factor that in general depends on L and on the statistical properties of the hidden 
components. However, it will be shown that if the components of a n and b m are independent random 
variables, k is independent of n and m, so it can be simply chosen in order to have s m ^ n defined over 
the same interval as s mj „. 
For instance, if we define 



M 

M ^ 

i=l 



where S* 



max si 



(3) 



As we 



then in order to keep estimations in the range [—1,1], k = - 

"max 

shall illustrate in the following, from this estimation of k we can get information about the dimension 
of hidden space L. 

We now justify the proposed formulas for the case in which the components of a n , b m are inde- 
pendent random variables distributed according to 



P(a«,6«)=P„, i (a)P m , i (6). 
Averages over ) of any function h(an\bm) are given by 

oo 

(h)= J2 h(a^,b^)P n ,(a)P m Ab)- 



(4) 



(5) 



For a set of hidden components distributed according to @, the opinions are uncorrelated in the 
thermodynamic limit. However, the idea is that the system present fluctuations mainly because L is 
finite, so correlations between opinions arise and can be used to predict unknown opinions. In order 
to keep the algebra simple, the discussion will be made for the case in which the variables a« and bm 
have zero mean. At the end a generalization to biased components will be given. 
The components can be written in matrix form as 



a am \ 



B 



\ a\,L ■■■ a N.L j 

so the opinion matrix is defined by 



\ h,L 



&M,1 \ 



bu,L J 



S = \(L)B 1 A, 



(6) 



(7) 



where A(L) is the normalization constant. 
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The opinion correlation matrix is essentially equivalent to 



C = 



ss 1 



(8) 



where (s 2 ) denotes the average of s 2 over P n ,i(a)Pm,i{b). Because of the finite size of the system, there 
are differences between the normalization factors in definitions and (JSJ) . These differences are small 
for large N and L, and we neglect them at this point because they give non dominant contributions 
to errors in the final expressions. An element of the opinion matrix S is expressed by the internal 
product 



= A(L) b m ja n j, 



1=1 



so averaging over the distribution P n .i{o)Pm.i{b) 

(s 2 )=X 2 (L)L(a 2 ){b 2 ). 
Using (|10|l the correlation matrix can be written as 



C = 



B T AA T B 



Let us now consider the expression 



M 



LN{a 2 ){b 2 }' 
\{L)B T AA T BB T A 



LN(a 2 )(b 2 ) 

If N and M are large, the central limit theorem can be applied to the following matrix products 



(9) 



(10) 



(11) 



(12) 



/ [n + o(Vn) + 



AA T = (a 2 ) 



0(1) 



V 



BB 1 



(b 2 ) 



( [m + o(Vm) + 

0(1) 

V 



0(1) 

[n + o(Vn) 



0(i) 

[M + 0{VM) 



\ 



(13) 



(14) 



Introducing Eqs. (|T3|) and ijTH into Eq. l(T2|) we obtain 



M 



CS 



S 

L 



O 



L- 1 



L 



I 1 
M y/N 



+ 



1 



(15) 



For large values of N and M, by comparing Eq. (|15fl with Eq. @, we can identify the factor k with 
the number of components L, and obtain an estimate for the average prediction error 



MN 



3/2 



M ■ 



N 



(16) 



where 



1 = X(L)^W). (17) 

Formula (|16() implies that the predictive power of Eq. (J2J grows with MN and diminishes with L. This 
fact is a consequence of the decay of the correlations among opinions with L, so that more amount of 
information is needed in order to perform a prediction as L grows. This condition can be compared 
with the "rigidity" threshold p2 in the MZ analysis. 
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4 Sparse and biased data 



In the real world one cannot expect to have at his disposal a fully connected opinion matrix. Indeed, 
one of the most important feature of an anticipation system is its hole-filling capability. 

One can extend the previous formalism to sparse datasets by considering the parameters AI, N 
as functions of the individual/product pair (m,n) in the following way: M n represents the available 
number of opinions over product n given by any agent and N m is the number of opinions expressed by 
agent m about any product. Using formula <|2J) with the redefined parameters M n and N m , it follows 
from Eq. (|15fl that an unknown opinion s m ,n can be estimated with an accuracy that scales as 



\S m ,n ~ Sm,n| ~ 7^ 2 / i f Ar (18) 

for large values of N m , M n and L. 

The accuracy of our approach can be related with the "rigidity" threshold p^. To illustrate this 
let us consider a situation in which N m = M n and N = M. From formula l|18|) it turns out that the 
relative error in the estimation of an opinion will be 

2L 

(iyj 



so in order to have relative errors order one or less, the inequality M n > 4L 2 must hold. This implies 
for the density of known opinions among all the elements of the opinion matrix that 

T M M 

P= M2 ~ 2Lp2 ' (20) 

which means that our formulas work above the "rigidity" threshold pi ■ 

Our formalism is generalizable to systems with biased components, exploiting essentially the same 
arguments used to justify Eq. (J2J. It is found that in this case the factor k that appears in the 
estimation formula @ is given by 

i , {L-iM b r ' 



L \ L J (b 2 ) 

Notice that k does not depend on the off variables, no matter if these variables are biased or not. 

The existence of a constant value of k independently of n and m justifies the previously proposed 
normalization approach k = — . Moreover, k can be interpreted as the effective number of compo- 

nents of the vector of internal preferences b m . For instance, if the variance (6 2 ) — (b) 2 is zero, then bm 
can take a unique value, so b m has only one effective degree of freedom, which is reflected by the value 
k = 1. On the other hand the variance of bm is maximum when (b) = 0, implying the value k = L 
when all the L degrees of freedom are relevant. 

The behavior of the distance between the anticipated and actual values of opinions in the biased 
case is again given as in Eqs. and i|18fl . with 



7 = X^(a 2 )[(b 2 )~(b) 2 } (22) 

The asymmetry of formulas H21(l and Q22JI with respect to variables a4 and bm is related to the 
fact that the opinion correlation matrix C basically reflects the overlap between the preferences of 
agents. To see this let us consider the following normalized overlap between bi and bj 

= Ef=Atty — (23) 
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Figure 1: Average estimation error s for L — 10 as a function of number of products N for two values of M: 
M = 500 (circles) and M = 1000 (crosses). The lines represent the best linear fit, with exponent —0.498 and 
-0.493, resp. 



For a large system size the opinion correlation matrix is written 

_ (X(L)B T A - (s)l)(\(L)A T B (s)l) 

N[(s 2 ) - (s) 2 ] 

By introducing the product AA T given in Eq. (|T3)l on formula it is found that 



a. . — o. . 



i + o\ -L f i + -L 

N V VI 



and the average error 



(24) 



(25) 



(26) 



m,m' 



should grow like cr ~ TV -1 / 2 . 

Eq. (|25|l states that for increasing TV the correlation between the expressed opinions of agents i and 
j tends to be equivalent to the overlap fij^. 



5 Numerical Results 

In order to test the obtained relationships, we have performed simple simulations using random data. 

The quantities L, M and N are free parameters. We have used discrete components in the {—1, 1} 
set, randomly generated with variable average. 

We have computed the opinion matrix S (Eq. Q), the correlation matrix C (Eq. (JTJ) and the 
actual overlap matrix O (Eq. (|23(l ). 

Then we have iterated over all the individuals' opinions s mn computing s mn from Eq. J2J), accu- 
mulating the average quadratic estimation error e, Eq. (|l(jfl . 

Figures ^ 121 an d El show that the theoretical average errors, Eq. (|16fl . are in good agreement with 
simulations. 

Moreover, we show in Figure 0] that the distance a between C and fl goes like TV -1 / 2 , as expected 
from Eq. (|2"5)l . 
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Figure 2: Average estimation error e for L = 10 as a function of population size M for two values of AT: 
N — 500 (circles) and N = 1000 (crosses). The lines represent the best linear fit, with exponent —0.515 and 
—0.530, resp. 
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Figure 3: Average estimation error e as a function of L for N = M = 400. The line represent the best linear 
fit, with exponent 0.523. 
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Figure 4: Average error a (Eq. (|26|0 as a function of the product number N . The dashed line marks 
the linear fitting a ~ N~ 0A1 . 



6 Discussion and conclusions 

We assumed that an opinion is formed as a scalar product between individual preferences and prod- 
ucts' properties (both unobservable) . This assumption relies on a kind of "universality" in cognitive 
processes, so that the opinion formation process should be analogous to other brain activity like the 
olfactory system, but honestly we do not have any rigorous justification. Individuals' opinions are 
assumed to be stored in a database. 

We have shown that, using central limit theorem (i.e. uncorrelated data) it is possible to anticipate 
an individual's opinion, i.e. there is the possibility of exploiting the correlations in the database to 
forecast the missing opinions. Alternatively, we can obtain information about the overlaps of tastes 
between two individuals from the knowledge of their expressed opinions. 

We have also shown that one can extract information about the dimensionality of the hidden taste 
space from the opinion database. We have also recovered the (almost trivial) expectation that the 
prediction error decreases when both the size of individual and product pools grow, and increases with 
the dimension of the hidden space. 

We have not considered here the problem of coevolution of tastes and product qualities (which 
are produced in accordance to expectations about clients' expectations). The coevolution of products' 
features and individuals' preferences induces correlations: people are not expected to blindly choose 
one movie from the available ones, but they tend to watch movies based on their anticipated opinion, 
thus filling the dataset with correlated data. On the other hand, movies are produced based on market 
expectation, reducing still more the variability. 

The role of education emerges from this simple model: reliable opinion anticipations, that constitute 
an expectation of "revenues" from cultural investments, can come only from an assorted background 
of experiences both from a personal point of view, but also from the community's one (due to the need 
of individuals 's correlations). 

Finally, this model illustrate the value contained in personal information and the need for their 
protection. 

Experimental verifications of the model are difficult, since personal data are jealously conserved. 
However, it is possible to identify similar "scalar product-like" mechanism in chemical or biological 
interactions 4 , for which experimental data may be more easily available. 
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